LZP Data Compression
Introduction
Hi everybody! Are you new to data compression, or do you just want to learn about another algorithm? Then this document is exactly what you need. ;) In the next few minutes (assuming that you continue reading this), I will introduce the LZP compression algorithm to you. "LZP?!", some of you might ask, "Shouldn't it be LZW?". - No, LZP is pefectly right. In fact, not many people know about this algorithm yet, maybe because it is still quite "young" (As far as I know, it was discovered only 2 years ago). Nevertheless, LZP is very efficient, not only from the speed-, but also from the compression point of view, and extremely easy to implement. Therefore, it is probably the ultimate algorithm to be combined with other compression techniques. :)
Disclaimer
Everything I will tell you about LZP doesn't necessarily have to be correct. I could not find any useful documents covering LZP on the web, and the algorithm was also just explained to me by some friends. So I have to write anything straight from my brain, which might lead to mistakes. Anyway, I'm pretty sure that at least some of the stuff I remember is correct, as the included example programs seem to work quite fine. ;) If you encounter any bugs, though, send me an email. You can find the adress at the end of this article.
LZP - Lempel Ziv Penis (?!)
Actually, the idea behind LZP compression is not that new. Not at all, to be exact, as it is heavily based on LZ77, and a bit on LZ78.
To understand what LZP exactly does, we have to take a quick look at how LZ77 compresses data, first. For a simple LZ77 implementation, just imagine the following:
You got a ring buffer of -say- 8192 bytes, which is initialized to zero. For compression, you read some bytes from your input stream and write them to the buffer. Then you have to search through the buffer for a string which matches most of the bytes just read. If you found one, just write its position in the buffer and its length to the compressed output stream, else write an uncompressed character.
"Well, what's the problem?", you might ask. In our case, the ring buffer is 8192 bytes big, and for every match we find, we must encode both its position in the buffer, and its length. Considering the buffer size of 8192 bytes, we have to write 13 bits just to specify the position in the buffer our match begins, plus an additional few bits for the length of the match. Doesn't sound too efficient, eh? There are some ways to improve this, though, and LZP actually uses a very clever one...
LZP - Lempel Ziv + Prediction (!)
Do we really have to encode the position where a match starts? Can't we reduce the number of different match positions in any way, so that we do not have to write that many bits just for encoding the position? Actually, we can. (Did you expect anything else?? :)) In fact, we can can even reduce the number of bits to be written for the position to zero.
But how could we do that? - It is much easier than you might imagine. We are going to use some sort of prediction. To predict if the input stream (from its current position) already occured in a way (we have a match, that is), we take the last n bytes read and combine them to form a hash index (I). A hash index is just a value used to access an array. This array will be resetted to zero at compression startup, and will be filled with pointers to possible starting positions of matches during compression.
Now that we have got our hash index and our hash table, we take the pointer from hash_table[I] and store it into a variable (M). Then we update the hash table by writing a pointer to the current position of the input stream to hash_table[I]. If the pointer (M) we read before actually points somewhere (= it is not zero), we possibly found a position in the preceeding part of the input stream which matches a part of the input stream at our current position.
We then try to determine the length of the match by comparing the data from our current position of the input stream with the data our "probable-match" pointer (M) points to. If we have actually found a match, we just store a flag which indicates that we found a match followed by the length of the match to the output stream. If we couldn't find a match, we just store a flag which indicates that a single byte follows, followed by the current byte in the input stream. If the pointer (M) didn't point anywhere, we can be sure that there is no match and safely write just the current character of the input stream to the output stream. :)
In pseudo-code, this looks something like this:
while (input_bytes > 0) {
input_ptr = pointer_to_input_stream;
prebytes = pointer_to_(input stream - n);
I = calculate_hash_index(prebytes);
M = get pointer from hash_table[I];
hash_table[I] = input_ptr;
if (M != 0) {
match_length = get length by comparing input_ptr-data with M-data;
if (match_length >= 1) {
write_match_found_flag;
write_match_length;
input_bytes = input_bytes - match_length;
pointer_to_input_stream = pointer_to_input_stream + match_length;
} else {
write_match_not_found_flag;
write_character;
input_bytes = input_bytes - 1;
pointer_to_input_stream = pointer_to_input_stream + 1;
}
} else {
write_character;
input_bytes = input_bytes - 1;
pointer_to_input_stream = pointer_to_input_stream + 1;
}
}
That's (almost) everything you have to know. It's so easy. :)
Some words about hashing
Basically, the hash index can be calculated any way you want. The sicker your function to calculate the index is, the better will the compression be. ;) However, I was told (and also experienced on my own) that standard xor-style hashing will produce the best compression ratios. Be careful though that your hash index will never get bigger than the size of the hash table, because this will make your program crash. :)
The first implementation
For the first version of our little LZP compression program we are going to use a 12-bit hash table (4096 bytes). The hash function we are going to use is hash_index(i) = ((i >> 11) ^ i) & 4095, where "i" is built from the preceeding 3 bytes of the input buffer. The matches will be encoded as fixed-length codes of 6 bits. You can see the full source of such a lame implementation in the included file LZP1.CPP.
First improvements
A larger hash table might be handy for bigger files, though it will slightly hurt compression levels on smaller files. LZP2.CPP demonstrates the usage of a 16-bit hash table (64k). This time, 4 bytes are used to generate the hash index, and the hash function is hash_index(i) = ((i >> 15 ^ i) & 65535.
The third try
Encoding the match lengths using fixed-length codes sucks. A much better approach is using variable-length integer codes for this. The algorithm LZP3.CPP uses looks something like this:
if (match_length >= 3) {
write_to_output(3, 2); // write_to_output (value, number_of_bits);
match_length -= 3;
if (match_length >= 7) {
write_to_output(7, 3);
match_length -= 7;
if (match_length >= 31) {
write_to_output(31, 5);
match_length -= 31;
if (match_length >= 0xff) {
while (match_length >= 0xff) {
write_to_output(0xff, 8);
match_length -= 0xff;
}
write_to_output(match_length, 8);
} else write_to_output(match_length, 8);
} else write_to_output(match_length, 5);
} else write_to_output(match_length, 3);
} else write_to_output(match_length, 2);
Decompression
Decompressing LZP compressed data uses almost the same algorithm as the one used for compressing it, which makes this a really easy task. :) It shouldn't be too difficult for you to implement it on your own, but I give you the pseudo-code for decompression and ready decompression routines for all three examples (UNLZP1.CPP - UNLZP3.CPP).
Decompression pseudo-code:
do {
input_ptr = pointer_to_destination_buffer;
prebytes = pointer_to_(destination_buffer - n);
I = calculate_hash_index(prebytes);
M = get pointer from hash_table[I];
hash_table[I] = input_ptr;
if (M != 0) {
prefix = get one bit from input stream;
switch (prefix) {
case 0 : *destination_buffer++ = get 8 bits from input stream;
break;
case 1 : match_length = get match length from input stream;
for i := 1 to length
*destination_buffer++ = *M++;
break;
}
} else *destination_buffer++ = get 8 bits from input stream;
} while (input_stream_not_empty);
Comparison
Type of file | Size (bytes) | LZP1 | LZP2 | LZP3 | PKZIP -ex
-------------------------------------------------------------------------
word .doc | 6522368 | 1974863 | 1371824 | 950879 | 1198215
word .doc | 1121196 | 404130 | 306190 | 241755 | 244069
ascii .txt | (*) 903 | 746 | 785 | 764 | 445
ascii .txt | 6636 | 4652 | 4865 | 4638 | 2592
dos .exe | 49358 | 41122 | 40043 | 39088 | 28672
-------------------------------------------------------------------------
Considering that PKZIP's deflating-algorithm is much more complex than our LZP implementation, the result is quite good. I was especially surprised about the fact that the big word documents are compressed that good by LZP3.CPP's technique. :)
(*) ... so LZP is not very good for 4k intros or Hugi-Compo entries. ;)
Conclusion
As you can see, LZP is a very powerful algorithm for data compression. It is probably one of the very fastest out there, and its compression is good, too. And don't forget: The examples I gave you don't show the real power of LZP, but rather demonstrate how the algorithm works. If you, for example, combine them with huffman techniques, you quickly get a compression tool which is better than PKZIP in both compression and speed. My archive manager uses such techniques (well, I don't use huffman, but rather... Nah, I won't tell you now. ;)) and I'm quite satisfied with the performance. :)
If you still have any questions, or if you made any experience with LZP compression you want to share with me, or if you just want to talk with me about anything concerning code, send me an email: mmichels@nrh.de
Thank you for reading, and good luck!